Statistical Agenda Parsing
نویسنده
چکیده
This paper presents the results of converting a standard Greham/Harrison/Ruzzo (GHR) parser for a unification grammar into an agenda-driven parsing system. The agenda is controlled by statistical measures of grammar-rule likelihood obtained from a training set. The techniques in the agenda parser lead to substantial reductions in chart size and parse time, and can be applied to any chart-based parsing algorithm without hand-tuning. I N T R O D U C T I O N In a Graham/Harrison/Ruzzo (GHR) parser, the chart is used to maintain a record of syntactic constituents that have been found (terms) and grammatical rules that have been partially matched (dotted rules). Parsing strategies such as GHR, CKY and other algorithms can be viewed as methodical ways of filling the chart which guarantee to explore all possible extensions of dotted rules by terms. An agenda is an alternative chart-filling algorithm with the goal of finding some term covering the entire input without necessarily filling in all of the chart. If terms can be ranked by "goodness" and the grammar can produce multiple analyses of a given string, then one goal for an agenda is to produce the "best" parse first. The alternative goal we have chosen for DELPHI is to use the agenda mechanism to reduce the search necessary to produce ACCEPTABLE (see below) parses. This results in sparsely populated charts, approaching the extreme (and probably unattainable) goal of deterministic parsing, in which the only terms and dotted rules entered into the chart are those which appear as parts of the final parse. The techniques involved in statistical agenda parsing allow "low probability" rules to be added to a grammar without significant cost in terms of either erroneous parses or increased parse time. These low probability rules greatly increase the coverage and robustness of the system by accounting for unusual or marginal constructions. D E L P H I A G E N D A P A R S I N G Most techniques for search splice reduction involve careful tuning of the grammar or the parsing mechanism. This is very labor intensive and can place limits on the grammatical coverage of the system (Abney 1990). Our approach is to use an automated statistical technique for ranking rules based on their use in parsing a training set with the same grammar (under the control of an all-paths GHR parser without human supervision). This approach also allows us to include grammatical rules that are of use only rarely, or in specialized domains, and to learn how applicable they are to a body of sentences. To take into account general linguistic tendencies, we augment the statistical ranking by a small number of general agenda ordering strategies. The DELPHI agenda mechanism is based on three "scbedulable" action types: 1. the insertion of a term into the chart, 2. the insertion of a dotted rule into the chart, and 3. the (conditional) "pair extension" of a dotted rule by a term. In principle one would like to order those actions in terms of the probability that they lead to a final parse. The initial implementat ion of the agenda mechanism uses an approximation to this ordering. U S E O F S T A T I S T I C A L M E A S U R E S There are two types of measures that one might estimate to help the agenda parsing mechanism. They are (1) category expansion probabilities and (2) rule success probabilities. Category Expansion Probabilities Category expansion probabilities are perhaps the more obvious of the two measures. The goal is to determine the probability that a given syntactic category (e.g., NP) is expanded by a given grammar rule in a valid parse. These probabilities allow one to estimate the probability that a given tree is the expansion of a given category. Bayes' rule may be used to calculate the relative probabilities of various parse trees for a specified input string. Rule Success Probabilit ies Using rule success probabilities, the goal is to determine the probability that a term inserted into the chart by a particular rule will be part of a Fmal parse.
منابع مشابه
Edge-Based Best-First Chart Parsing
Best-first probabilistic chart parsing attempts to parse efficiently by working on edges that are judged ~'best" by some probabilistic figure of merit (FOM). Recent work has used probabilistic context-free grammars (PCFGs) to assign probabilities to constituents, and to use these probabilities as the starting point for the FOM. This paper extends this approach to using a probabilistic FOM to ju...
متن کاملEeciency, Robustness and Accuracy in Picky Chart Parsing
This paper describes Picky, a probabilistic agenda-based chart parsing algorithm which uses a technique called prob-abilistic prediction to predict which grammar rules are likely to lead to an acceptable parse of the input. Using a subopti-mal search method, Picky signiicantly reduces the number of edges produced by CKY-like chart parsing algorithms, while maintaining the robustness of pure bot...
متن کاملControlling Bidirectional Parsing
Traditional models of parsing as used in interfaces have shown to be weak and ine ective in complex tasks such as processing of naturally-occurring texts. Broad coverage parsers go mad when confronted with extended inputs without su cient information to control the interpretation process. The use of e ective control strategies is necessary to overcome these shortcomings. Extra-linguistic criter...
متن کاملبررسی مقایسهای تأثیر برچسبزنی مقولات دستوری بر تجزیه در پردازش خودکار زبان فارسی
In this paper, the role of Part-of-Speech (POS) tagging for parsing in automatic processing of the Persian language is studied. To this end, the impact of the quality of POS tagging as well as the impact of the quantity of information available in the POS tags on parsing are studied. To reach the goals, three parsing scenarios are proposed and compared. In the first scenario, the parser assigns...
متن کاملAn improved joint model: POS tagging and dependency parsing
Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do the POS tagging task along with dependency parsing in a pipeline mode. Unfortunately, in pipel...
متن کامل